Module 1 In-Class Activities

PLS 206 — Applied Multivariate Modeling

Author

Grey Monroe

Published

January 1, 2026

Important

No complete code is provided. Each activity gives you a starting point — you write the rest. There are no wrong answers, only things to discover.

Not all activities are meant for a single class session — pick based on time and where the class needs to slow down or speed up.


Activity 1 · Meet Your Data

Time: 10 minutes Goal: Before you model anything, you need to know your data. This activity builds the habit of looking before fitting.

The dataset

wine <- read.csv("winedata.csv")

Your tasks

Work through these in order. Each one should be a single line of R.

1. How big is the dataset?

# How many rows and columns?

2. What are the variables?

# Show variable names, types, and a few values

3. Are there any missing values?

# Check for NAs — how many per column?

4. What does the outcome variable look like?

The column quality is the wine quality score (0–10). Get its summary statistics and make a quick plot of its distribution.

# Summary statistics for quality

# Plot its distribution

5. Which variable is most correlated with quality?

Compute the correlation of each numeric predictor with quality. You can do this one at a time or with cor() on the whole data frame.

# Your code here

6. Write one sentence:

“Based on the data, I think the best single predictor of wine quality is _______ because _______.”

Discuss

  • Did everyone pick the same predictor?
  • Does correlation = causation here?
  • What would you want to know about how quality was measured?

Activity 2 · Vectors & Data Frames from Scratch

Time: 8 minutes Goal: Build a data frame by hand, manipulate it, and make a plot — no CSV, no built-in dataset.

Setup

Imagine you measured plant height (cm) for two treatments — watered normally and drought-stressed — with 5 plants each:

Plant Treatment Height
1–5 Control 22, 18, 25, 21, 19
6–10 Drought 14, 11, 16, 10, 13

Your tasks

1. Create a data frame called plants with columns plant_id, treatment, and height.

# Build it with c() and data.frame()

2. Add a new column height_m that converts height to meters.

# Add column using $

3. Subset to only drought plants and compute their mean height.

# Subset, then mean()

4. Make a boxplot comparing the two treatments using base R boxplot() or ggplot2.

library(ggplot2)
# Your plot

5. Run a t-test. Is the height difference significant?

# t.test()

Extension: Add a third treatment (heat_stress, heights: 16, 12, 18, 11, 15) and redo the comparison with ANOVA.


Activity 3 · The Anscombe Quartet

Time: 10 minutes Goal: Prove to yourself that summary statistics lie. Always plot your data.

Step 1 — summarize without plotting

R has the Anscombe Quartet built in. Start by computing summary statistics only — no plots yet.

data(anscombe)
head(anscombe)

For each of the four (x, y) pairs (x1/y1 through x4/y4), compute:

  • Mean of x and y
  • SD of x and y
  • Correlation between x and y
  • Slope from lm(y ~ x)
# e.g. mean(anscombe$x1), cor(anscombe$x1, anscombe$y1), etc.

Write down the four sets of results. What do you notice?

Step 2 — now plot them

Make a 2×2 panel of scatterplots, one per dataset, each with a regression line.

library(ggplot2)
library(tidyr)
library(dplyr)

# Reshape to long format
# Hint: you want columns: dataset, x, y

# Then:
# ggplot(long_df, aes(x, y)) +
#   geom_point() +
#   geom_smooth(method = "lm", se = FALSE) +
#   facet_wrap(~dataset)
Tip

If reshaping is taking too long, just make 4 separate plots with base R: plot(anscombe$x1, anscombe$y1) etc.

Discuss

  • Which dataset violates the assumptions of linear regression most severely?
  • Which one has an outlier that’s completely controlling the slope?
  • What would you do differently for each dataset if this were your real data?

Activity 4 · Spot the Bug

Time: 8 minutes Goal: Build debugging instincts. Reading error messages is a skill — practice it.

The script

This script has 5 bugs. Find them all without running the code first. Then run it and fix them one at a time.

# Load data
data(iris)

# Compute mean sepal length by species
tapply(iris$Sepal.Length, iris$species, mean)

# Subset to setosa only
setosa <- iris[iris$Species = "setosa", ]

# Run a t-test: does setosa sepal length differ from 5.0?
t.test(setosa$Sepal.Length, mu = 5, alternative = "two.tailed")

# Correlation between sepal length and petal length
cor_result <- cor.test(iris$Sepal.Length, iris$Petal.Length)
cat("r =", round(cor_result$estimate, 3),
    "p =", round(cor_result$pvalue, 4))

# Plot
library(ggplot2)
ggplot(iris, aes(x = Sepal.Length, y = Petal.Length, color = Species)) +
  geom_point

The bugs involve: column name case, comparison operator, argument string value, object element name, and a missing function call.

After you fix it

  • What was the hardest bug to spot? Why?
  • Which error message was most helpful? Which was most cryptic?
  • What habit would have prevented each bug?

Activity 5 · Interpret the Output

Time: 8 minutes Goal: Read statistical output cold — no code, no context, just results. This is what peer review looks like.

Output A

    Welch Two Sample t-test

t = 3.142, df = 37.6, p-value = 0.003
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 0.821  3.679
sample estimates:
mean of x mean of y
    12.45      10.20

Answer without looking anything up:

  1. What is the null hypothesis being tested?
  2. What is your conclusion at α = 0.05?
  3. What is the estimated difference between groups?
  4. Is the effect large or small? What’s missing that would help you answer that?

Output B

    Pearson's product-moment correlation

t = 8.943, df = 58, p-value < 0.001
95 percent confidence interval:
 0.621  0.843
sample estimates:
      cor
0.7612

Answer:

  1. Is this a strong, moderate, or weak correlation?
  2. What does the confidence interval tell you that the p-value doesn’t?
  3. A student writes: “x causes y with 76% probability.” What’s wrong with this?

Activity 6 · The Confidence Interval Game

Time: 8 minutes Goal: Make “95% confidence” real. You’ll generate 100 confidence intervals and count how many contain the true mean.

Your task

Fill in the blanks and run the code:

set.seed(...)    # pick any number — don't tell your neighbor

true_mean <- 10
true_sd   <- 3
n         <- 30  # sample size per interval
n_sims    <- 100

results <- replicate(n_sims, {
  samp <- rnorm(..., mean = ..., sd = ...)
  ci   <- t.test(samp)$conf.int
  c(lower = ci[1], upper = ci[2])
})

results           <- as.data.frame(t(results))
results$captured  <- results$lower < true_mean & results$upper > true_mean
sum(results$captured)

Plot them

results$sim <- 1:n_sims

plot(range(c(results$lower, results$upper)), c(1, n_sims),
     type = "n", xlab = "Value", ylab = "Simulation",
     main = paste0(sum(results$captured), " / 100 intervals captured true mean"))
abline(v = true_mean, lwd = 2, lty = 2)
for (i in 1:n_sims) {
  segments(results$lower[i], i, results$upper[i], i,
           col = ifelse(results$captured[i], "steelblue", "red"), lwd = 1.2)
}

Discuss

  • How many did you get? Shout it out.
  • What does the spread of answers across the class tell you?
  • Now change n to 10, then 100. What changes and what doesn’t?